Window-based Stereo Matching¶

Stereo matching plays a pivotal role in 3D computer vision by computing the disparity map between a pair of stereo images. Nowadays, traditional methods, like the one presented in this notebook, have largely fallen out of favor in practical applications, primarily due to the emergence of superior neural network-based algorithms.

Nevertheless, diving into the fundamentals of stereo matching offers a valuable opportunity to grasp the foundational aspects of this field and gain insights into potential pitfalls and challenges that persist in stereo matching algorithms. These challenges encompass issues such as occlusion, variations in image brightness, and image noise, which are prevalent in real-world scenarios. Familiarizing myself with these challenges serves as a building block for developing intuition on how to tackle them when working with more advanced neural network-based approaches. 😊

RULES: As usual, OpenCV is banned in this repository.

Introduction¶

Stereo matching aims to determine pixel disparities along the x-axis between two provided images: a left image and a right image. It assumes that the y-axes of these stereo images are perfectly aligned. However, in reality, this assumption often doesn't hold true, necessitating stereo image rectification. The accuracy of this rectification process largely depends on locating feature points in both stereo images. For this exercise, we assume perfect y-axis alignment, allowing us to proceed with this tutorial.

To clarify, in the context of the stereo images provided below, the left image is captured from the left viewpoint, and the right image is captured from the right viewpoint. Objects in the left image may appear to the right of their counterparts in the right image due to the inherent nature of stereo images.

Following the disparity map convention, the disparity value is computed by subtracting the x-coordinate of a pixel in the right image from the x-coordinate of the corresponding pixel in the left image, as shown in the equation:

$$ \text{disparity} = \text{pixel}_{\text{left}} - pixel_{\text{right}} $$

For instance, if we aim to determine the disparity value for a pixel, such as the eyes of the statue, in the left image, the disparity value would be positive, indicating that this pixel is to the right of its corresponding point in the right image.

It's important to note that if we reverse the order of the left and right images, the sign of the disparity will be inverted.

Left Image

Left Image

The statue appears more to the right in the left image compared to its position in the right image.

Right Image

Right Image

The right image is captured from the right viewpoint.

Note: In this notebook, all discussions and equations follow the conventions commonly used in computer vision for defining image coordinates. The x-axis extends towards the right, and the y-axis extends downward. However, for implementation, it's crucial to reverse the order of the parameters. This is because, in a numpy array, the first axis corresponds to the downward direction, while the second axis corresponds to the rightward direction.

In [ ]:
import numpy as np
import matplotlib.pyplot as plt
from skimage import io
import image_processing_utils

# 0-1
gray_left = image_processing_utils.read_image("./input/pair0-L.png")
gray_right = image_processing_utils.read_image("./input/pair0-R.png")
statue_left = image_processing_utils.read_image("./input/pair1-L.png")
statue_right = image_processing_utils.read_image("./input/pair1-R.png")
cleaner_left = image_processing_utils.read_image("./input/pair2-L.png")
cleaner_right = image_processing_utils.read_image("./input/pair2-R.png")
# 0-255
statue_left_gt = io.imread("./input/pair1-D_L.png")
statue_right_gt = io.imread("./input/pair1-D_R.png")
cleaner_left_gt = io.imread("./input/pair2-D_L.png")
cleaner_right_gt = io.imread("./input/pair2-D_R.png")

Energy Minimization¶

One of the approaches within stereo matching involves framing the problem as an energy function minimization task. This energy function comprises two fundamental components: the data term and the smoothness term, which can be formally expressed as:

$$ E(D) = \alpha E_{data}(D) + \beta E_{smooth}(D) $$

The data term, denoted as $E_{data}(D)$, can be informally described as follows:

$$ E_{data}(D) = \sum_{(x,y)\in{I}} C(x, y, D(x, y)) $$

Here, it quantifies the cost associated with the disparity values $D(x, y)$ for all pixel coordinates $(x, y)$ within the image $I$.

The smoothness term, on the other hand, is defined as:

$$ E_{smooth}(D) = \sum_{(p,q)\in{\epsilon}} V(d_p, d_q) $$

This term captures smoothness constraints by evaluating the relationship between disparities $d_p$ and $d_q$ for pixel pairs belonging to a defined neighborhood $\epsilon$. In this context, $\epsilon$ represents the set of neighboring pixels, and $V(d_p, d_q)$ signifies the $L1$ distance between them. To illustrate, given a disparity value $d_p$ for a pixel $p$ in the disparity map $D$, $d_q$ encompasses the collection of disparity values for all neighboring pixels surrounding $p.

It's important to note that we perform stereo matching in both the left-to-right and right-to-left directions. This is necessary due to variations in depth (discontinuities) and occlusions. In other words, some pixels that are visible in the left image may not be visible in the right image, and vice versa.

In [ ]:
def cal_disp_map(
    left_img, right_img, window_size=3, disp_range=(-32, 32), cost_fun=None
):
    assert cost_fun is not None
    assert left_img.shape == right_img.shape
    # disparity could be positive or negative floating-point.
    disparity_map = np.zeros_like(left_img, dtype=np.float64)

    half_win = window_size // 2  # 3->1, 5->2
    height, width = left_img.shape  # Get the image dimensions

    # Loop through each pixel in the left image. Do not
    for y in range(height):
        for x in range(width):
            min_cost = float("inf")
            best_disparity = None

            # Define the search window coordinates
            y_min, y_max = max(0, y - half_win), min(height - 1, y + half_win + 1)
            x_min, x_max = max(0, x - half_win), min(width - 1, x + half_win + 1)
            assert image_processing_utils.coord_is_valid(
                x_min, x_max, y_min, y_max, width, height
            ), f"{x_min}, {x_max}, {y_min}, {y_max}, {width}, {height}"

            left_patches = []
            right_patches = []
            disp_candidates = []
            # Calculate the cost for each disparity value in the disparity range.
            for disp in range(disp_range[0], disp_range[1]):
                # Apply disparty to the x axis of the right window.
                if (
                    image_processing_utils.coord_is_valid(
                        x_min - disp,
                        x_max - disp,
                        y_min,
                        y_max,
                        width,
                        height,
                    )
                    is False
                ):
                    continue

                # Extract the left and right image patches
                left_patch = left_img[y_min : y_max + 1, x_min : x_max + 1]
                right_patch = right_img[
                    y_min : y_max + 1, x_min - disp : x_max - disp + 1
                ]
                left_patches.append(left_patch)
                right_patches.append(right_patch)
                disp_candidates.append(disp)

            costs = cost_fun(
                np.stack(left_patches, axis=0), np.stack(right_patches, axis=0)
            )
            assert np.any(np.isnan(costs)) == False
            assert costs.size == len(disp_candidates)
            disparity_map[y, x] = disp_candidates[np.argmin(costs)]

    return disparity_map

Sum of Squared/Absolute Difference and Cosine Similarity¶

Our primary goal is to find the disparity map $D$ that minimizes $E(D)$. To achieve this, we define the cost function for each pixel $(x, y)$ as follows:

$$ c(x, y, d) = \sum_{i=-\frac{W}{2}}^{\frac{W}{2}} \sum_{j=-\frac{W}{2}}^{\frac{W}{2}} \text{Diff}(I_{\text{left}}(x+i, y+j), I_{\text{right}}(x+i-d, y+j)) $$

Here, $d$ spans a predefined range and $W$ represents the width and height of the target window. Our aim is to find the optimal disparity value that minimizes the cost $c(x, y, d)$ for the pixel $(x, y)$.

The $\text{Diff}(a, b)$ function quantifies the dissimilarity between two vectors $a$ and $b$. We employ three methods to assess the magnitude of differences between two multi-dimensional vectors. These methods are as follows:

Sum of Squared Differences (SSD):

$$ [I_{left}(x+i, y+j) - I_{right}(x+i-d, y+j)]^2 $$

Sum of Absolute Differences (SAD):

$$ |I_{left}(x+i, y+j) - I_{right}(x+i-d, y+j)| $$

Cosine Similarity:

$$ \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|} = \frac{\sum_{i=1}^{n} (A_i \cdot B_i)}{\sqrt{\sum_{i=1}^{n} A_i^2} \cdot \sqrt{\sum_{i=1}^{n} B_i^2}} $$

Conventionally, SSD and SAD are two functions that measure the dissimilarity of two vectors, with a smaller cost indicating a better match. However, in the case of cosine similarity, a larger value indicates that two vectors are more similar. Therefore, we need to inverse the sign in the implementation.

We are now prepared to implement SSD, SAD, and cosine similarity functions. To validate our implementations, we will employ a toy example.

In [ ]:
def ssd(left, right):
    """Sum of Squared Difference
    left, right: C x H x W"""
    assert left.shape == right.shape and left.ndim == 3
    return np.sum((left - right) ** 2, axis=(1, 2))


def sad(left, right):
    """Sum of Absolute Difference
    left, right: C x H x W"""
    assert left.shape == right.shape and left.ndim == 3
    return np.sum(np.absolute(left - right), axis=(1, 2))


def cosine_similarity(left, right):
    """Cosine Similarity
    left, right: C x H x W"""
    assert left.shape == right.shape and left.ndim == 3
    dividend = np.sum(np.multiply(left, right), axis=(1, 2))
    assert dividend.shape[0] == left.shape[0]
    divisor = np.multiply(
        np.sqrt(np.sum(left**2, axis=(1, 2))),
        np.sqrt(np.sum(right**2, axis=(1, 2))),
    )
    assert divisor.shape[0] == left.shape[0]
    assert np.all(divisor != 0), f"divisor has 0 in it. {divisor}"
    return -1 * np.divide(dividend, divisor)


image_data = [(ssd, "SSD"), (sad, "SAD"), (cosine_similarity, "Cosine Similarity")]

fig, ax = plt.subplots(nrows=1, ncols=5, figsize=(15, 3))
fig.suptitle("Toy stereo images and left-to-right disparity map.")
ax[0].set_title("Left image")
ax[0].imshow(gray_left, cmap="gray", aspect="equal")
ax[1].set_title("Right image")
ax[1].imshow(gray_right, cmap="gray", aspect="equal")

for i, (cost_fun, title) in enumerate(image_data):
    disp_map = cal_disp_map(
        gray_left, gray_right, window_size=3, disp_range=(-5, 5), cost_fun=cost_fun
    )

    ax[i + 2].set_title(f"{title}")
    im = ax[i + 2].imshow(disp_map, cmap="RdYlGn", aspect="equal", vmin=-5, vmax=5)

    cbar = fig.colorbar(im, ax=ax[i + 2], orientation="vertical")

fig.tight_layout()
plt.show()
No description has been provided for this image

Discussion:

In our toy example, the black square in the left image is approximately two pixels to the right of the corresponding square in the right image. Consequently, we anticipate disparity values of approximately two in the left-to-right disparity map. Additionally, it's essential to note that both stereo images must have the same size, and the output disparity map will match the size of the input images.

With this pipeline validated, we are ready to apply it to more realistic stereo images.

Defining the Optimal Disparity Search Range¶

Before applying our stereo matching pipeline to "realistic" images, it's crucial to address how we define the search range for the disparity map, encompassing both the minimum and maximum disparities. In our toy example and "realistic" images with known ground truth disparity maps, determining the optimal range is straightforward. However, in real-world scenarios, obtaining this information is challenging due to unpredictable object sizes and varying distances from the camera. For instance, in a self-driving car's context, cars and pedestrians can have varying distances, leading to diverse disparity values.

To tackle this challenge, one approach is to employ feature matching algorithms like SIFT to locate and match feature points in stereo images. Once we identify matching feature points in the left and right images, we can estimate the initial disparity range by measuring the x-axis distance between these points (possibly with some margins). As the number of paired feature points in the stereo image increases, these points tend to scatter across the four corners of the image, providing a more accurate estimate of the disparity search range.

Moreover, it's important to note that the search range for disparity values can encompass both negative and positive values. In 3D computer vision, stereo images are typically rectified using the fundamental matrix. As a result, disparity values can vary across the entire range, including both negative and positive values, rather than being limited to exclusively negative or positive values. This flexibility is essential to accurately represent disparities in the real world.

However, for the purposes of this tutorial, we will simplify the process by using the disparity values from the ground truth image to set disp_min and disp_max.

In [ ]:
disp_min = 0
disp_max = 150
In [ ]:
image_data = [
    (ssd, "SSD", "equal"),
    (sad, "SAD", "equal"),
    (cosine_similarity, "Cosine Similarity", "equal"),
]

fig, ax = plt.subplots(nrows=1, ncols=4, figsize=(18, 4))
fig.suptitle("Ground Truth and Left-to-right disparity map.")

ax[0].set_title(
    f"Ground Truth. Min:{np.min(statue_left_gt)}. Max:{np.max(statue_left_gt)}"
)
ax[0].imshow(statue_left_gt / 2, cmap="RdYlGn", aspect="auto", vmin=0, vmax=150)
ax[0].set_axis_off()


for i, (cost_fun, title, aspect) in enumerate(image_data):
    disp_map = cal_disp_map(
        statue_left,
        statue_right,
        window_size=5,
        disp_range=(disp_min, disp_max),
        cost_fun=cost_fun,
    )
    ax[i + 1].set_title(f"{title}. Min:{np.min(disp_map)}. Max:{np.max(disp_map)}")
    im = ax[i + 1].imshow(disp_map, cmap="RdYlGn", aspect=aspect, vmin=0, vmax=150)
    ax[i + 1].set_axis_off()

cbar = fig.colorbar(im, ax=ax[0], orientation="vertical")
cbar.set_label("Disparity Value", rotation=270, labelpad=20)

fig.tight_layout()
plt.show()
No description has been provided for this image

Discussion:

  • SSD and SAD produce nearly identical results, with cosine similarity slightly outperforming them. This improvement can be attributed to cosine similarity's ability to assess vector direction rather than merely focusing on magnitude differences. Therefore, we will exclusively compare SSD and cosine similarity in the ongoing experiments.

  • Occlusion: On the left edges of the image, we observe extremely small disparity values, which are unrealistic and indicative of occlusion. Ideally, occluded pixels should undergo separate processing to clearly identify which pixels should be excluded from application due to occlusion.

  • Incorrect Disparity in Ground Truth: An unusual observation regarding the ground truth image is that its values are inaccurate. Upon manual measurement using an image viewer, the measured disparity between stereo images is only half the value indicated in the ground truth image.

Next, we will apply the same pipeline to aother realistic image, as demonstrated below.

In [ ]:
image_data = [
    (ssd, "SSD", "equal"),
    (sad, "SAD", "equal"),
    (cosine_similarity, "Cosine Similarity", "equal"),
]

fig, ax = plt.subplots(nrows=1, ncols=4, figsize=(18, 4))
fig.suptitle("Ground Truth and Left-to-right disparity map.")

ax[0].set_title(
    f"Ground Truth. Min:{np.min(cleaner_left_gt)}. Max:{np.max(cleaner_left_gt)}"
)
ax[0].imshow(cleaner_left_gt / 2, cmap="RdYlGn", aspect="auto", vmin=0, vmax=150)
ax[0].set_axis_off()


for i, (cost_fun, title, aspect) in enumerate(image_data):
    disp_map = cal_disp_map(
        cleaner_left,
        cleaner_right,
        window_size=5,
        disp_range=(disp_min, disp_max),
        cost_fun=cost_fun,
    )
    ax[i + 1].set_title(f"{title}. Min:{np.min(disp_map)}. Max:{np.max(disp_map)}")
    im = ax[i + 1].imshow(disp_map, cmap="RdYlGn", aspect=aspect, vmin=0, vmax=150)
    ax[i + 1].set_axis_off()

cbar = fig.colorbar(im, ax=ax[0], orientation="vertical")
cbar.set_label("Disparity Value", rotation=270, labelpad=20)

fig.tight_layout()
plt.show()
No description has been provided for this image

Introducing Noise and Brightness Variation¶

We are now prepared to investigate how SSD and cosine similarity perform when faced with stereo images subjected to noise and brightness variation. Specifically, the following modifications have been made to the left image, while the right image remains unaltered:

  • Gaussian Noise Addition: Gaussian noise with a mean of 0.0 and a standard deviation of 0.05 has been introduced to the left image.

  • Brightness Enhancement: The brightness of the left image has been gradually increased by 30% from left to right. In other words, if the left image pixel values range from 0 to 1, we have added a range, 0 to 0.3, to each pixel's value.

We will now examine the impact of these adjustments as follows:

In [ ]:
def add_noise(image, noise_mean=0, noise_std=0.05):
    # Generate random white noise.
    noise = np.random.normal(loc=noise_mean, scale=noise_std, size=image.shape)

    noisy_image = np.clip(image + noise, 0, 1)
    return noisy_image


def brightness(shape, minmax=(0, 0.1)):
    height = shape[0]
    width = shape[1]
    mid = (minmax[0] + minmax[1]) / 2
    ret = np.zeros((height, width)) - mid
    step = (minmax[1] - minmax[0]) / width
    for i in range(width):
        ret[:, i] = step * i
    return ret


def vary_brightness(image, ratio=0.1):
    image_min = np.min(image)
    image_max = np.max(image)
    image_range = image_max - image_min

    brightness_variation = image_range * ratio
    tmp = brightness(image.shape, minmax=(0, brightness_variation))
    adjusted_image = np.clip(image + tmp, 0, 1)
    return adjusted_image


image_data = [(statue_left, "Statue image"), (cleaner_left, "Cleaner image")]

fig, ax = plt.subplots(nrows=2, ncols=3, figsize=(13, 7))
ax[0, 0].set_title(f"Original image")
ax[0, 1].set_title("Gaussian noise")
ax[0, 2].set_title("Brightness variation")
for i, (image, title) in enumerate(image_data):
    noise_image = add_noise(image, noise_mean=0, noise_std=0.05)
    bright_image = vary_brightness(image, ratio=0.3)
    ax[i, 0].imshow(image, cmap="gray", aspect="equal")
    ax[i, 0].set_axis_off()
    ax[i, 1].imshow(noise_image, cmap="gray", aspect="equal")
    ax[i, 1].set_axis_off()
    ax[i, 2].imshow(bright_image, cmap="gray", aspect="equal")
    ax[i, 2].set_axis_off()
fig.tight_layout()
plt.show()
No description has been provided for this image
In [ ]:
noise_statue_left = add_noise(statue_left, noise_mean=0, noise_std=0.05)
bright_statue_left = vary_brightness(statue_left, ratio=0.3)
image_data = [
    (ssd, noise_statue_left, "Gaussian noise. SSD."),
    (ssd, bright_statue_left, "Brightness variation. SSD."),
    (cosine_similarity, noise_statue_left, "Gaussian noise. Cosine similarity."),
    (cosine_similarity, bright_statue_left, "Brightness variation. Cosine similarity."),
]

fig, ax = plt.subplots(nrows=1, ncols=5, figsize=(20, 4))
fig.suptitle(
    "Ground Truth and disparity maps with Gaussian noise and brightness variation."
)

ax[0].set_title(f"Ground Truth.")
ax[0].imshow(statue_left_gt / 2, cmap="RdYlGn", aspect="auto", vmin=0, vmax=150)
ax[0].set_axis_off()

for i, (cost_fun, img, title) in enumerate(image_data):
    disp_map = cal_disp_map(
        img,
        statue_right,
        window_size=5,
        disp_range=(disp_min, disp_max),
        cost_fun=cost_fun,
    )
    ax[i + 1].set_title(f"{title}")
    ax[i + 1].imshow(disp_map, cmap="RdYlGn", aspect="equal", vmin=0, vmax=150)
    ax[i + 1].set_axis_off()

fig.tight_layout()
plt.show()
No description has been provided for this image
In [ ]:
noise_cleaner_left = add_noise(cleaner_left, noise_mean=0, noise_std=0.05)
bright_cleaner_left = vary_brightness(cleaner_left, ratio=0.3)
image_data = [
    (ssd, noise_cleaner_left, "Gaussian noise. SSD."),
    (ssd, bright_cleaner_left, "Brightness variation. SSD."),
    (cosine_similarity, noise_cleaner_left, "Gaussian noise. Cosine similarity."),
    (
        cosine_similarity,
        bright_cleaner_left,
        "Brightness variation. Cosine similarity.",
    ),
]

fig, ax = plt.subplots(nrows=1, ncols=5, figsize=(20, 4))
fig.suptitle(
    "Ground Truth and disparity maps with Gaussian noise and brightness variation."
)

ax[0].set_title(f"Ground Truth.")
ax[0].imshow(statue_left_gt / 2, cmap="RdYlGn", aspect="auto", vmin=0, vmax=150)
ax[0].set_axis_off()

for i, (cost_fun, img, title) in enumerate(image_data):
    disp_map = cal_disp_map(
        img,
        cleaner_right,
        window_size=5,
        disp_range=(disp_min, disp_max),
        cost_fun=cost_fun,
    )
    ax[i + 1].set_title(f"{title}")
    ax[i + 1].imshow(disp_map, cmap="RdYlGn", aspect="equal", vmin=0, vmax=150)
    ax[i + 1].set_axis_off()

fig.tight_layout()
plt.show()
No description has been provided for this image

Conclusion¶

In my humble opinion, window-based stereo matching appears to be one of the most straightforward and accessible algorithms among various stereo matching methods. Throughout this tutorial, we've gained insights into implementing window-based stereo matching using three different techniques. Additionally, we've witnessed how these methods exhibit varying performance when subjected to external factors affecting image quality, such as noise and brightness variation.

To Be Discussed¶

Normalized Cross Correlation, Occulsion, filter disparity with large costs greater than threshold. Graph cut.

In [ ]: